{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 12 数据拆分\n",
    "\n",
    "`pandas.Series.str.slice(start, stop)`: 和切片一样，左开右闭。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <td>0</td>\n",
       "      <td>1.527933</td>\n",
       "      <td>1.000667</td>\n",
       "      <td>-0.152676</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>1</td>\n",
       "      <td>-1.463362</td>\n",
       "      <td>-0.409950</td>\n",
       "      <td>0.845998</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>2</td>\n",
       "      <td>-0.338109</td>\n",
       "      <td>1.410190</td>\n",
       "      <td>0.413910</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>3</td>\n",
       "      <td>0.397799</td>\n",
       "      <td>-0.346103</td>\n",
       "      <td>0.956951</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <td>4</td>\n",
       "      <td>0.851239</td>\n",
       "      <td>-0.207375</td>\n",
       "      <td>0.234587</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          A         B         C\n",
       "0  1.527933  1.000667 -0.152676\n",
       "1 -1.463362 -0.409950  0.845998\n",
       "2 -0.338109  1.410190  0.413910\n",
       "3  0.397799 -0.346103  0.956951\n",
       "4  0.851239 -0.207375  0.234587"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(data=np.random.randn(5, 3), columns=list(\"ABC\"))\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.527933\n",
       "1    1.463362\n",
       "2    0.338109\n",
       "3    0.397799\n",
       "4    0.851239\n",
       "Name: A, dtype: float64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['A'] = df.A.abs()\n",
    "df.A"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     1.5279332784870372\n",
       "1     1.4633617791476639\n",
       "2    0.33810944255895947\n",
       "3     0.3977993999940166\n",
       "4     0.8512387709234015\n",
       "Name: A, dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['A'] = df.A.astype(str)\n",
    "df['A']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.5\n",
       "1    1.4\n",
       "2    0.3\n",
       "3    0.3\n",
       "4    0.8\n",
       "Name: A, dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df['A'].str.slice(0, 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pandas.Series.str.split(pat, n=-1, expand=False)`: 按照分隔符拆分。\n",
    "\n",
    "- pat: 分隔符，默认为空格\n",
    "- n: 分割n+1列，-1代表所有列\n",
    "- expand: 是否展开为数据框，默认False, 一般设置为True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`pandas.Series.dt`: 按照时间属性拆分。dt对象有如下多种属性：\n",
    "\n",
    "- second\n",
    "- minute\n",
    "- hour\n",
    "- day\n",
    "- month\n",
    "- year\n",
    "- weekday: 0是星期一，5是星期六, 6是星期日"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " `pandas.Series.str.contains(pat, case=True, na=NaN)`: 关键词抽取\n",
    " \n",
    " - pat: 要包含的字符\n",
    " - case: 英文是否忽略大小写\n",
    " - na: 缺失值的填充方式，在过滤的场景下需设置为False，也就是缺失值返回False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "比如上面的 df，可以使用`df.A.isnull()`函数抽取空值。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3    False\n",
       "4    False\n",
       "Name: A, dtype: bool"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.A.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据范围抽取采用比较条件(>, <, ==, >=, <=, !=)。 还可以使用 `between(left, right, inclusive=True)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "4    False\n",
       "Name: B, dtype: bool"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.B > 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "4    False\n",
       "Name: B, dtype: bool"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.B.between(-0.01, 2, True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 13 数据标准化\n",
    "\n",
    "0-1 标准化是最常用的数据标准化的方法。方法为：向量中的每个值与所在向量中的最小值得差，除以所在向量中的最大值和最小值得差。\n",
    "\n",
    "`x* = (x - x.min) / (x.max - x.min)`。这样处理完之后最大值变成1，最小值变成0，其他介于0-1之间了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
